Practicing exploratory data analysis with a more complex dataset and in R Markdown. The source data for this exercise is a free dataset from ProPublica http://www.propublica.org.
ProPublica’s description of this dataset says:
“This dataset is a snapshot of the gang database maintained by the sheriff’s office and jail in Cook County, Illinois, along with other law enforcement agencies. Names and other personally identifying details have been removed, but the data includes information about the gender, appearance, gang affiliation, zip code, and race of the individuals listed. Additional information is also included, such as whether the individual wears gang colors or has tattoos, has self-identified their gang involvement, is under probation and more.”
The dataset only contains detail data from the month of June 2018. We’ll see if we can find some interesting patterns from the various variables that describe the dataset.
First, load the data:
cookgangdata <- read_excel("CCSO_ITU_FOIA_Dumke_06112018_DRAFT_v2.0_Age_questions.xlsx")
Now, let’s take an initial look at the data and some summary info:
head(cookgangdata)
## # A tibble: 6 x 25
## Subject_ID Subject_Sex chicago address_state address_zip Subject_Gang_ID
## <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 6 M Gary IN 46409 FBJB Gang (Gar…
## 2 7 M Gary IN 46409 Vice Lords (Pe…
## 3 8 M Gary IN 46403 Vice Lords (Pe…
## 4 10 M Gary IN 46408 Vice Lords (Pe…
## 5 11 M Gary IN 46408 Vice Lords (Pe…
## 6 12 M Gary IN 46409 Vice Lords (Pe…
## # … with 19 more variables: Subject_Height <chr>, Subject_Weight <chr>,
## # Subject_Felon <chr>, Subject_Probation <chr>,
## # Subject_Admits_Gang <chr>, Subject_Wears_Colors <chr>,
## # Subject_Armed <chr>, Subject_Race_ID <chr>,
## # Subject_Eye_Color_ID <chr>, Subject_Hair_Color_ID <chr>,
## # Subject_Create_Date <dttm>, Subject_Approved_Date <dttm>,
## # Subject_Deceased <chr>, `Has the individual been arrested in the
## # company of known criminal gang members for offenses which are
## # consistent with criminal gang activity?` <chr>, `Has the individual
## # been identified by an individual of proven reliability as a criminal
## # gang member?` <chr>, `Has the individual admitted membership in a
## # criminal gang, and was this a credible self-admission made to a law
## # enforcement officer or agent?` <chr>, `Does the individual possess
## # tattoos that a trained law enforcement officer or agent has reasonable
## # suspicion to believe signify gang membership?` <chr>, `Does the
## # individual reside in or frequent a particular criminal gang's area or
## # affect their style of dress, use of hand signs, symbols, or maintain
## # an ongoing relationship with known criminal gang members, and where
## # the law enforcement officer documents reasonable suspicion that the
## # individual is involved in criminal gang-related activity or
## # enterprise?` <chr>, `Age as of 8/6/18` <chr>
#str(cookgangdata)
#summary(cookgangdata)
A couple notes about the data:
Starting with renaming the columns and converting some of the data types:
## Review column name lengths
nchar(colnames(cookgangdata))
## [1] 10 11 7 13 11 15 14 14 13 17 19 20 13 15 20 21 19
## [18] 21 16 141 100 141 144 355 16
## Store the old column names in a vector
oldnames <- colnames(cookgangdata)
## Make the reeeeaaally long column names shorter than 50 characters just to see
## the data frame in the date viewer
colnames(cookgangdata)[nchar(colnames(cookgangdata))>50] <- substr(colnames(cookgangdata)[nchar(colnames(cookgangdata))>30], 1, 50)
## Now really rename those long column names
colnames(cookgangdata)[20] <- c("ArrestedWithGangMembers")
colnames(cookgangdata)[21] <- c("IdentifiedAsGangMember")
colnames(cookgangdata)[22] <- c("SelfAdmittedGangMember")
colnames(cookgangdata)[23] <- c("HasGangTattoos")
colnames(cookgangdata)[24] <- c("LiveInGangAreaOrKnowsGangMembers")
## Change the two date columns to character columns because you can't compare the "NULL"
## string to a date (which I'll do right after this step).
cookgangdata$Subject_Create_Date <- as.character.Date(cookgangdata$Subject_Create_Date)
cookgangdata$Subject_Approved_Date <- as.character.Date(cookgangdata$Subject_Approved_Date)
## Change all the "NULL" and "N/A" strings to NAs
cookgangdata[cookgangdata == "NULL"] <- NA
cookgangdata[cookgangdata == "N/A"] <- NA
## Change the character date fields back the Date data type
cookgangdata$Subject_Create_Date <- as.POSIXct(cookgangdata$Subject_Create_Date)
cookgangdata$Subject_Approved_Date <- as.POSIXct(cookgangdata$Subject_Approved_Date)
## Change the data types of weight and age variables
cookgangdata$Subject_Weight <- as.numeric(cookgangdata$Subject_Weight)
cookgangdata$Age <- as.numeric(cookgangdata$`Age as of 8/6/18`)
## Convert the height variable which is encoded in 3 characters as ft-in to inches as a numeric data type
cookgangdata$Subject_Height_In_Inches <- as.numeric(substr(cookgangdata$Subject_Height, 1, 1))*12 + as.numeric(substr(cookgangdata$Subject_Height, 2, 3))
## Convert the other variables to factors
cookgangdata$Subject_Sex <- as.factor(cookgangdata$Subject_Sex)
cookgangdata$address_city <- as.factor(cookgangdata$chicago)
cookgangdata$address_state <- as.factor(cookgangdata$address_state)
cookgangdata$address_zip <- as.factor(cookgangdata$address_zip)
cookgangdata$Subject_Gang_ID <- as.factor(cookgangdata$Subject_Gang_ID)
cookgangdata$Subject_Race_ID <- as.factor(cookgangdata$Subject_Race_ID)
cookgangdata$Subject_Eye_Color_ID <- as.factor(cookgangdata$Subject_Eye_Color_ID)
cookgangdata$Subject_Hair_Color_ID <- as.factor(cookgangdata$Subject_Hair_Color_ID)
cookgangdata$Subject_Felon <- as.factor(cookgangdata$Subject_Felon)
cookgangdata$Subject_Probation <- as.factor(cookgangdata$Subject_Probation)
cookgangdata$Subject_Admits_Gang <- as.factor(cookgangdata$Subject_Admits_Gang)
cookgangdata$Subject_Wears_Colors <- as.factor(cookgangdata$Subject_Wears_Colors)
cookgangdata$Subject_Armed <- as.factor(cookgangdata$Subject_Armed)
cookgangdata$Subject_Deceased <- as.factor(cookgangdata$Subject_Deceased)
cookgangdata$ArrestedWithGangMembers <- as.factor(cookgangdata$ArrestedWithGangMembers)
cookgangdata$IdentifiedAsGangMember <- as.factor(cookgangdata$IdentifiedAsGangMember)
cookgangdata$SelfAdmittedGangMember <- as.factor(cookgangdata$SelfAdmittedGangMember)
cookgangdata$HasGangTattoos <- as.factor(cookgangdata$HasGangTattoos)
cookgangdata$LiveInGangAreaOrKnowsGangMembers <- as.factor(cookgangdata$LiveInGangAreaOrKnowsGangMembers)
Note that some of the fields will not be used and are left in their original state, e.g. the ‘chicago’ field.
There are three numeric variables of interest: age, height, and weight. We’ll visualize these with histograms.
ggplot(cookgangdata, aes(x = Age)) + geom_histogram(binwidth = 2,na.rm = TRUE)
Age is unimodal, right-skewed, and has moderately wide distribution. It’s median is just above 25. It has a long tail out beyond 75.
p <- ggplot(cookgangdata, aes(x = Age)) + geom_histogram(binwidth = 2,na.rm = TRUE)
p + facet_wrap(. ~ Subject_Race_ID)
Break age distributions out by race, we see that Black, Hispanic, White, and NA members have similar distributions: unimodal and skewed right, though the White member distribution is considerably flatter than the others. Also, the White members have a higher median age, around 28 years old. Black, Hispanic, and NA members have a median age of about 25 years old.
ggplot(cookgangdata, aes(x = Subject_Weight)) + geom_histogram(binwidth = 10,na.rm = TRUE)
Weight is virtually unimodal, slightly right-skewed, and has a fairly tight distribution. It’s median is around 170 pounds, though has some outliers above 350 pounds.
ggplot(cookgangdata, aes(x = Subject_Height_In_Inches)) + geom_histogram(binwidth = 1,na.rm = TRUE)
Height is unimodal, slightly left-skewed, and tightly distributed. It’s median is about 68 inches (5’8“), but has some outliers down below 55 inches (4’7”) and above 80 inches (6’8").
p <- ggplot(cookgangdata, aes(x = Subject_Height_In_Inches))
p + geom_histogram(binwidth = 1,na.rm = TRUE) + facet_wrap(. ~ Subject_Race_ID)
Breaking the one histogram into several by race shows that Black, White, and Hispanic gang members all have bell-shaped curves - unimodal and symmetric. The median height for both Black and White members is about the same, around 70 inches. However, the median height for Hispanic members is a bit less, at about 66 inches. Other races don’t have enough members to provide meaningful insights,.
Also, there are a few categorical variables that are interesting, too: race, gender (Subject_Sex), gang affiliation (Subject_Gang_ID), is or isn’t a felon, is or isn’t on probation, identified as a gang member, self-identified as a gang member, arrested with gang members, armed, wears gang colors, and has tattoos. We’ll visualize these with bar graphs.
p <- ggplot(cookgangdata, aes(x = Subject_Race_ID))
p <- p + geom_bar(aes(y = (..count..)/sum(..count..)), na.rm = TRUE)
p <- p + ylab(label = "Proportion of total")
p + theme(axis.text.x = element_text(angle = 45, hjust = 1))
The highest proportion races are Black, Hispanic, and White, with a high proportion of NAs.
p <- ggplot(cookgangdata, aes(x = Subject_Sex))
p <- p + geom_bar(aes(y = (..count..)/sum(..count..)), na.rm = TRUE)
p <- p + ylab(label = "Proportion of total")
p + theme(axis.text.x = element_text(angle = 45, hjust = 1))
Yeah, mostly male gender here.
p <- ggplot(cookgangdata, aes(x = Subject_Gang_ID))
p <- p + geom_bar(aes(y = (..count..)/sum(..count..)), na.rm = TRUE)
p <- p + ylab(label = "Proportion of total")
p + theme(axis.text.x = element_text(angle = 45, hjust = 1))
Whoa! A lot of different gangs. Let’s see if we can narrow them down to the top 10 or 20 by count in the database.
## Group by gang affiliation for 'by gang' statistics
i<- group_by(cookgangdata, Subject_Gang_ID) %>% summarise(n = n()) %>% arrange(desc(n))
## Warning: Factor `Subject_Gang_ID` contains implicit NA, consider using
## `forcats::fct_explicit_na`
summary(i)
## Subject_Gang_ID n
## 12th Street Players (People): 1 Min. : 1.0
## 156th Boys (Calumet City) : 1 1st Qu.: 1.0
## 18th Street Gang : 1 Median : 3.0
## 2-1 Nation (People) : 1 Mean : 57.8
## 211 Gang (South Bend) : 1 3rd Qu.: 17.0
## (Other) :426 Max. :4354.0
## NA's : 1
## Determine top 20 gangs by count in the database
top20 <- head(i[!is.na(i$Subject_Gang_ID), ], 20)
top20
## # A tibble: 20 x 2
## Subject_Gang_ID n
## <fct> <int>
## 1 Gangster Disciples (Folk) 4354
## 2 Latin Kings (People) 3650
## 3 Black P Stones (People) 1372
## 4 Four Corner Hustlers (People) 1190
## 5 Vice Lords (People) 1024
## 6 Two-Sixers (26ers) (Folk) 786
## 7 Black Disciples (Folk) 652
## 8 Deleted Duplicate Records (DO NOT USE) 638
## 9 Maniac Latin Disciples (Folk) 570
## 10 Traveling Vice Lords (People) 544
## 11 Conservative Vice Lords (People) 532
## 12 Satan Disciples (Folk) 527
## 13 Imperial Gangsters (Folk) 475
## 14 Spanish Gangster Disciples (Folk) 372
## 15 Haughville (Indianapolis) 325
## 16 Surenos (SUR-13) 304
## 17 Latin Counts (People) 294
## 18 Latin Dragons (Folk) 255
## 19 Black Gangsters/New Breed (Folk) 223
## 20 Fifth Avenue Boys (5th Avenue Boys) (Gary) (People) 215
## Determine top 10 gangs by count in the database
top10 <- head(i[!is.na(i$Subject_Gang_ID), ], 10)
#top10
## Determine top 10 gangs by count in the database
top5 <- head(i[!is.na(i$Subject_Gang_ID), ], 5)
#top5
p <- ggplot(cookgangdata[cookgangdata$Subject_Gang_ID %in% top20$Subject_Gang_ID,], aes(x = Subject_Gang_ID))
p <- p + geom_bar(aes(y = (..count..)/sum(..count..)), na.rm = TRUE)
p <- p + ylab(label = "Proportion of total")
p + theme(axis.text.x = element_text(angle = 45, hjust = 1))
The Gangster Disciples and Latin Kings are way above all of the rest at nearly 25% and 20% of the total, respectively.
Let’s see how the top 10 gangs (by gang membership) are split by the most represented racial groups: Black, Hispanic, and White.
p <- ggplot(cookgangdata[cookgangdata$Subject_Gang_ID %in% top10$Subject_Gang_ID & cookgangdata$Subject_Race_ID %in% c("Black", "Hispanic", "White"),], aes(x = Subject_Gang_ID))
p <- p + geom_bar(aes(y = (..count..)/sum(..count..)), na.rm = TRUE)
p <- p + ylab(label = "Proportion of total") + facet_wrap(. ~ Subject_Race_ID)
p + theme(axis.text.x = element_text(angle = 45, hjust = 1))
Across the top 10 gangs by member counts, it’s apparent that those gangs are made up of one race by far. For example, the Gangster Disciples, the largest gang, is majority Black, though there are some Hispanic and White members. Similarly, the Latin Kings is mostly Hispanic, though some members are White or Black. These can be contrasted with the Black Disciples and the Traveling Vice Lords which have 100% Black membership.
p <- ggplot(cookgangdata, aes(x = Subject_Felon))
p <- p + geom_bar(aes(y = (..count..)/sum(..count..)), na.rm = TRUE)
p <- p + ylab(label = "Proportion of total")
p + theme(axis.text.x = element_text(angle = 45, hjust = 1))
Mostly NAs in this list, but also a good proportion of felons.
p <- ggplot(cookgangdata, aes(x = Subject_Probation))
p <- p + geom_bar(aes(y = (..count..)/sum(..count..)), na.rm = TRUE)
p <- p + ylab(label = "Proportion of total")
p + theme(axis.text.x = element_text(angle = 45, hjust = 1))
Again, lots of NAs - too many to any meaningful insight.
p <- ggplot(cookgangdata, aes(x = IdentifiedAsGangMember))
p <- p + geom_bar(aes(y = (..count..)/sum(..count..)), na.rm = TRUE)
p <- p + ylab(label = "Proportion of total")
p + theme(axis.text.x = element_text(angle = 45, hjust = 1))
Identified as a gang member or NA. Doesn’t seem meaningful either.
p <- ggplot(cookgangdata, aes(x = SelfAdmittedGangMember))
p <- p + geom_bar(aes(y = (..count..)/sum(..count..)), na.rm = TRUE)
p <- p + ylab(label = "Proportion of total")
p + theme(axis.text.x = element_text(angle = 45, hjust = 1))
Not meaningful.
p <- ggplot(cookgangdata, aes(x = ArrestedWithGangMembers))
p <- p + geom_bar(aes(y = (..count..)/sum(..count..)), na.rm = TRUE)
p <- p + ylab(label = "Proportion of total")
p + theme(axis.text.x = element_text(angle = 45, hjust = 1))
Not meaningful.
p <- ggplot(cookgangdata, aes(x = Subject_Armed))
p <- p + geom_bar(aes(y = (..count..)/sum(..count..)), na.rm = TRUE)
p <- p + ylab(label = "Proportion of total")
p + theme(axis.text.x = element_text(angle = 45, hjust = 1))
A bit more info, but still won’t provide much insight.
p <- ggplot(cookgangdata, aes(x = Subject_Wears_Colors))
p <- p + geom_bar(aes(y = (..count..)/sum(..count..)), na.rm = TRUE)
p <- p + ylab(label = "Proportion of total")
p + theme(axis.text.x = element_text(angle = 45, hjust = 1))
A large proportion wear gang colors, but also a large proportion are listed as NA.
p <- ggplot(cookgangdata, aes(x = HasGangTattoos))
p <- p + geom_bar(aes(y = (..count..)/sum(..count..)), na.rm = TRUE)
p <- p + ylab(label = "Proportion of total")
p + theme(axis.text.x = element_text(angle = 45, hjust = 1))
Nearly evenly split between ‘Yes’ and NA. Not great.
Overall, the data quality of these category fields is pretty bad. Too much missing data.
So, I’ll see about generating some additional insight by putting some of the fields on a map, then try to visualize multiple variables together.
data("zipcode")
##zipcode
mapdata <- inner_join(x = cookgangdata, y = zipcode, by = c("address_zip" = "zip"))
## Warning: Column `address_zip`/`zip` joining factor and character vector,
## coercing into character vector
#Get the latest Install
#if(!requireNamespace("devtools")) install.packages("devtools")
#devtools::install_github("dkahle/ggmap", ref = "tidyup", force=TRUE)
#Set your API Key
#ggmap::register_google(key = "APIKEY")
p <- ggmap(get_googlemap(center = c(lon = -87.62255, lat = 41.69543), zoom = 10, scale = 1,
maptype ='terrain', color = 'color'))
## Source : https://maps.googleapis.com/maps/api/staticmap?center=41.69543,-87.62255&zoom=10&size=640x640&scale=1&maptype=terrain&key=xxx
p + geom_point(aes(x = longitude, y = latitude, colour = Subject_Race_ID),
data = mapdata, size = 0.5, na.rm = TRUE, position=position_jitter(h=0.03,w=0.03)) + theme(legend.position="right") + guides(colour = guide_legend(override.aes = list(size=10)))
To the south and east of the city, there are large concentrations of Black gang members. Hispanic gang members are more mostly concentrated north and west of Chicago, though there is also a concentration around the Indiana-Illinois border. Further out into the suburbs north, west, and south are locations for white gang members. Gang members from other ethnic groups do not have enough members to be apparent on the map.
col1 = "#011f4b"
col2 = "#6497b1"
col3 = "#b3cde0"
col4 = "#CC0000"
p + geom_point(aes(x = longitude, y = latitude, colour = Subject_Gang_ID), na.rm = TRUE,
data = mapdata[mapdata$Subject_Gang_ID %in% top5$Subject_Gang_ID,], alpha=0.25,
size = 0.5, position=position_jitter(h=0.03,w=0.03)) + theme(legend.position="right") + guides(colour = guide_legend(override.aes = list(size=10)))
It is hard to see the distribution of a few gangs on a map, so I’ve limited this map to where the members of the top 5 gangs live. Members of Gangster Disciples, the #1 gang by member numbers, live mostly south of downtown Chicago, with a concentration in Chicago Heights. Members of Latin Kings (#2) are found southeast and west of downtown Chicago. Black P Stones are south of downtown, while Four Corner Hustlers members are west. Vice Lords members are mostly southeast in Indiana.
p <- ggplot(data = subset(cookgangdata, !is.na(Age) & !is.na(Subject_Height_In_Inches)), aes(x = Age, y = Subject_Height_In_Inches))
p <- p + geom_point(na.rm = TRUE) + geom_smooth(method='lm',formula=y~x)
p <- p + ylab(label = "Subject Height in Inches") #+ facet_wrap(.~Subject_Race_ID)
p + theme(axis.text.x = element_text(angle = 45, hjust = 1))
Attempting to see if there is a correlation between multiple variables. First up is age vs. height. Fitting a line shows that there is a slight increase with height increasing as age increase. For ages less than 20, this makes sense as those that young could still be growing, but there may be other reasons why trends continue for peoples that are older.
p <- ggplot(data = subset(cookgangdata, !is.na(Age) & !is.na(Subject_Height_In_Inches)), aes(x = Age, y = Subject_Height_In_Inches))
p <- p + geom_point(na.rm = TRUE) + geom_smooth(method='lm',formula=y~x)
p <- p + ylab(label = "Subject Height in Inches") + facet_wrap(.~Subject_Race_ID)
p + theme(axis.text.x = element_text(angle = 45, hjust = 1))
## Warning in qt((1 - level)/2, df): NaNs produced
Breaking the population into racial groups, and then plotting age vs. height results in similar patterns for each racial group as the whole population, except for a couple (Indian and Native American) which don’t have that many data points.
p <- ggplot(data = subset(cookgangdata, !is.na(Age) & !is.na(Subject_Weight)), aes(x = Age, y = Subject_Weight))
p <- p + geom_point(na.rm = TRUE) #+ geom_smooth(method='glm',formula=y~x)
p <- p + ylab(label = "Subject Weight") #+ facet_wrap(.~Subject_Race_ID)
p + theme(axis.text.x = element_text(angle = 45, hjust = 1))
Now, plotting age vs. weight, it appears that there’s an direct correlation between the variables. Unfortunately, fitting a line doesn’t work properly probably because the weight numbers are discrete, not continuous, values.
p <- ggplot(data = subset(cookgangdata, !is.na(Age) & !is.na(Subject_Weight)), aes(x = Age, y = Subject_Weight))
p <- p + geom_point(na.rm = TRUE) #+ geom_smooth(method='glm',formula=y~x)
p <- p + ylab(label = "Subject Weight") + facet_wrap(.~Subject_Race_ID)
p + theme(axis.text.x = element_text(angle = 45, hjust = 1))
Breaking the population in racial groups again results in similar positive correlation between age and weight, except for those racial groups with few data points.
The biggest conclusion I can draw is that the data quality of many of the categorical variables is so poor with such a high percentage of NAs among other issues that I did not use them in any multi-variable analysis. This severely limits the usefulness of those variables and limits the insights that can be gained from this dataset.
Beyond that, the next most important insight is that the data shows that most gang members are either Black or Hispanic, followed by white. The distribution of their height, weight, and age follow similar distributions and are probably similar to the population as a whole. Most gang members are male.
I am not that familiar with Chicago, but it appears that higher gang member density is in the city of Chicago, closer suburbs to the north, west, and south, and in the more industrial towns in Indiana close to Chicago like Gary. Gang member density is much less in suburbs that are further out to the north and west of the city. This likely reflects the racial demographics of the city and the suburbs.
citation("ggmap")
##
## To cite ggmap in publications, please use:
##
## D. Kahle and H. Wickham. ggmap: Spatial Visualization with
## ggplot2. The R Journal, 5(1), 144-161. URL
## http://journal.r-project.org/archive/2013-1/kahle-wickham.pdf
##
## A BibTeX entry for LaTeX users is
##
## @Article{,
## author = {David Kahle and Hadley Wickham},
## title = {ggmap: Spatial Visualization with ggplot2},
## journal = {The R Journal},
## year = {2013},
## volume = {5},
## number = {1},
## pages = {144--161},
## url = {https://journal.r-project.org/archive/2013-1/kahle-wickham.pdf},
## }